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Abstract 

In most adaptive signal processing applications, system linearity is assumed and adaptive linear 
filters are thus used. The traditional class of supervised adaptive filters rely on error-correction 
learning for their adaptive capability. The kernel method is a powerful nonparametric modeling tool 
for pattern analysis and statistical signal processing. Through a nonlinear mapping, kernel methods 
transform the data into a set of points in a Reproducing Kernel Hilbert Space. KRLS achieves high 
accuracy and has fast convergence rate in stationary scenario. However the good performance is 
obtained at a cost of high computation complexity. Sparsification in kernel methods is know to 
related to less computational complexity and memory consumption 


1 Linear And Nonlinear Adaptive Filter 

The term filter usually refers to a system that is designed to extract information about a prescribed quantity of interest 
from noisy data. An adaptive filter is a filter that self-adjusts its input output mapping according to an optimization 
algorithm like predictive coding fJl mostly driven by an eri'or signal. Because of the complexity of the optimization 
algorithms, most adaptive filters are digital filters. With the available processing capabilities of current digital sig¬ 
nal processors, adaptive filters have become much more popular and are now widely used in various fields such as 
sound wave based communication devices ll2^ . face extraction EEa, camcorders and digital cameras, information 
retrieval Il25ll and medical monitoring equipments. 

1.1 Linear Adaptive Filter 

In most adaptive signal processing applications, system linearity is assumed and adaptive linear filters are thus used. 
The traditional class of supervised adaptive filters rely on error-correction learning for their adaptive capability. There¬ 
fore, the error is the necessary element of a cost function, which is a criterion for optimum performance of the filter. 

The linear filter includes a set of adaptively adjustable parameters (also known as weights), which is marked as uj{n — 
1), where n denotes discrete time. The input signal u(n) applied to the filter at time n, produces the actual response 
y(n) via 

y{n) = U3{n — 1)'^ u{n) 

Then this actual response is compared with the corresponding desired response d{n) to produce the error signal e(n). 
The error signal, in turn, acts as a guide to adjust the weights a; (n — 1) by an incremental value denoted by Auj{n). 
On the next iteration, uj{n) becomes the latest value of the weights to be updated. The adaptive filtering process is 
repeated continuously until the filter reaches a stop condition, which normally is that the weights adjustment is small 
enough. 

An important issue in the adaptive design, no matter linear or nonlinear adaptive filter, is to ensure the learning curve 
is convergent with an increasing number of iterations. Under this condition, we define the system is in a steady-state. 

1.2 Nonlinear Adaptive Filter 

Even though linear adaptive filtering can approximate non-linearity, the performance of adaptive linear filters is not 
satisfactory in applications where nonlinearities are significant. Hence more advanced nonlinear models are required. 
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At present, Neural Networks and Kernel Adaptive Filters are popular nonlinear models. By providing linearity in a 
high dimension feature space, Reproduce Kernel Hilbert Space (RKHS), and universal approximation in Euclidean 
space with universal kernels fll, the kernel adaptive filters are attracting more attention. Through a reproducing kernel, 
kernel adaptive filters map data from an input space to RKHS, where appropriate linear methods are applied to the 
transformed data. This procedure implements a nonlinear treatment for the data in the input space. Comparing with 
other nonlinear techniques, kernel adaptive filters have the following features; 

• They can be universal approximations whenever the kernel is universal. 

• They have no local minima with the squared error cost function. 

• They have moderate complexity in terms of computation and memory. 

• They belong to online learning method lITSll and have good tracking ability to handle nonstationary conditions. 
The details about kernel adaptive filters are introduced next. 


2 Reproducing Kernel Hilbert Space 


Reproducing Kernel Hilbert Space, RKHS for short, is a complete inner product space associate with a Mercer kernel. 
A Mercer kernel is a continuous, symmetric and positive definite function k : U x U ^ R, where U is the input 
domain in Euclidean space {L is the input order). The commonly used kernels includes the Gaussian kernel ([T]) 
and the polynomial kernel (|2]) |[T|. 

«(«,«')= (1) 

( 7 ^ 

k{u,u') = {v7u' + 1)P (2) 

If only one free parameter is inserted into the kernel function, k(m, •) is expressed as a transformed feature vector 
p{u) through a mapping : U —H, where H is the RKHS. Therefore ifiu) = k{u, ■). One of the most important 
properties of RKHS for practical applications is the one, called the “kernel trick” 

ip{u') = k{u,u') (3) 

that allows computing the inner products between two RKHS functions as a scalar evaluation in the input space by the 
kernel. 

Besides “kernel trick”, some other properties of RKHS related to this work are as follows. Assume 111 be any RKHS 
of all real-valued functions of u that are generated by the k{u,-). Suppose now two functions h{-) and g{-) are picked 
from the space H that are respectively represented by 

i i 

h{-) = ^ aiK{Ci, •) = X! 

2=1 2=1 


m m 

9i-) = ■) = Y 

f=i 

where Oi and bj are the expansion coefficients and both Ci and Cj € U for all i and j. 


1. Symmetry 


<h,g >=< g,h> 


2. Scaling and distributive property 

< (c/ + dg), h >= c < f,h > +d < g,h > 


3. Squared norm 


\\hf =< h,h>>0 
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3 Kernel Adaptive Filtering 


The kernel method is a powerful nonparametric modeling tool for pattern analysis and statistical signal processing. 
Through a nonlinear mapping, kernel methods transform the data into a set of points in a RKHS. Then various methods 
are utilized to find relationship between the data. There are many successful examples of this methodology including 
support vector machines (SVM) Q, kernel regularization network IQ, kernel principal component analysis (kernel 
PCA) 121 and kernel fisher discriminant analysis l8]. Kernel adaptive filters is a class of kernel methods. As a member 
of kernel adaptive filter, kernel affine projection algorithms (KAPA) include the kernel least mean square (KLMS) ifTOl 
as the simplest element and kernel recursive least squares (KRLS) 19) as the most computationally demanding. 

The main idea of kernel adaptive filtering can be summarized as follows; Transform the input data into a high- 
dimension feature space F, via a Mercer kernel. Then appropriate linear methods are subsequently applied on the 
transformed data. As long as a linear method in the feature space can be formulated in terms of inner products, 
there is no need to do computation in high-dimension space basing on “kernel trick”. It has been proved that the 
kernel adaptive filters with universal kernel has universal approximation property, i.e. for any continuous input-output 
mapping / : U —>■ M, V<t > 0, 3{M(i)}ig7v S U and real number such that ||/ — CiK^u This 

universal approximation property guarantees that the kernel method is capable of superior performance in nonlinear 
tasks. If we express a vector in F as 

n = Y^c^(p{u{i)) (4) 

i 

we obtain 

11/- (5) 

Furthermore, in the view of supervised learning, which requires the availability of a collection of desired responses, 
error cost functions (or error criteria) play significant role. Kernel adaptive filters provide a generalization of linear 
adaptive filters because these are a special case of the former when expressed in the dual space. Kernel adaptive 
filters exhibit a growing radial basis function network, learning the network topology and adaptive the free parame¬ 
ters directly from the data at the same time IT]. In the following of this section, KRLS and KLMS are introduced 
respectively. 

Consider the learning of a nonlinear function / : U ^ K based on a known sequence 

(tt(l), d(l)), (tt(2), (i(2)),..., (tt(n), d(n)), where U S is the input space, u{i), i = is the sys¬ 

tem input at sample time i, and d{i) is the corresponding desired response. 

3.1 Kernel Recursive Least Square Algorithm 

The KRLS is actually the recursive least squares algorithm (RLS) algorithm in RKHS. At each iteration, one needs to 
solve the regularized least squares regression to obtain /; 

n 

!!/(“(*)) - ^(*)ll^ + ^II/IIh (6) 

where A is the regularization term and || • ||g denotes the norm in H. 

This problem can alternatively be solved in a feature space and results in KRLS. The learning problem of KRLS with 
regularization in feature space F is to find a high-dimensional weight vector O G F that minimizes 

n 

¥>(«(*)) - d{i)f + MM\v (7) 

i=l 

where || • ||p denotes the norm in F. 

3.1.1 Approximate linear dependency for sparsilication 

Similar to RLS, KRLS achieves high accuracy and has fast convergence rate in stationary scenario. However the good 
performance is obtained at a cost of high computation complexity, O(n^), n is the number of processed sample. 

In order to decrease the computational complexity of KRLS, sparisification techniques are adopted. Specifically, spar- 
sification in kernel methods is related to less computational complexity and memory consumption. Furthermore, the 
system’s generalization ability is also influenced by sparsification in machine-learning algorithms and signal process¬ 
ing m. Novelty Criterion, Approximate Linear Dependency (ALD) , Prediction Various, Surprise and Quantized 
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techniques are common strategies for sparisification. Among them, ALD is an effective sparisification technique for 
KRLS because it is solved in the feature space, unlike most of the other techniques. Before introducing ALD, several 
concepts should be interpreted; 

Network size; the number of data utilized to describe the system model. 


Center c; the data utilized to build the system. Therefore, the number of centers equals to network size. 


Center dictionary C: the set of all centers c. 


Suppose after having observed n — 1 training samples, we have established a center dictionary C(n — 1) = 

where K{n — 1) is the cardinality of C{n — 1). When a new sample {u{n), d{n)} is presented, ALD tests whether 

there exists a coefficient a{n) = (oi,..., aK(„_i)) satisfying 


^2 = min 

a 


iC(n-l) 

a™¥’(ci) - (p{u{n)) 

i=l 


2 

< 6 


( 8 ) 


where S, called the approximation level, is the threshold determining the sparisification level as well as system ac¬ 
curacy. If this condition is satisfied, the new sample in feature space could be linearly approximated by the centers 
in C{n — 1). Therefore the effect of this sample to the mapping can be expressed through existing centers in center 
dictionary and there is no need to augment the center dictionary. Otherwise, the new sample whose feature vector 
is not approximately dependent on the samples should be added into dictionary and, consequently, a new coefficient 
corresponding to this center will be included. Through straight calculation it is easy to obtain, 

a(n) = K{n — l)~^h{n) (9) 

d 2 = K(u(n), u(n)) — h(n)'^a(n) (10) 

where h(n) — [/c(ci, tt(n)),..., tt(n))]^, the matrix [K{n — 1)]^- = K,{ci,Cj). ALD not only is an 

effective approach to sparsification but also improves the overall stability of the algorithm because of its relation with 
the eigenvalue of IT ifTl. 


3.1.2 Regularized KRLS-ALD 

Now they are two simply way to deal with KRLS-ALD; 1) Setting regularization term equal to 0 0 . 2) Discarding 
the samples outside of the center dictionary and ignoring the influence of these samples m. However, both of these 
two strategies have disadvantages. With the sparsification, the probability of overfitting decreases, even not happens 
when data number is small enough. This is the inspiration that method 1 used to set the regularization term equal 
to 0. Unfortunately, the success of this method depends on the extent of sparsification. If the approximation level 
is not large enough, it doesn’t mitigate overfitting. Owning to discarding useful information, the convergency speed 
and accuracy performance of the second method may be not satisfactory. In order to overcome these drawbacks, I 
proposed a general structure of regularized KRLS-ALD. 

Define the matrices $(n) = [<^(m(1)), ..., ip{u{n))], $(n) = ..., according to 0 , 

$(n) = ^{n)A{nf + (11) 

r2(n-I-1) = $(n)Q;(n - 1 - 1) (12) 

where A{n) = [a(l), a(n)]^. Then, the cost function becomes 

n 

L{a{n -p 1)) = y]] ||0(n)^(p(M(i)) - d{i)f + A||0(n)||^ 

- (13) 

= ||$(n)’^l>(n)a(n -I- 1) - d{n)\\^ + A||0(n)||F 

R::! \\A{n)K{n)a{n -f 1) - d(n)||^ -I- A||$(n)Q;(n + l)||p 

where d{n) = ...(i(n)]^. In order to minimize L{a{n)), we take the derivative with respect to a.{n) and obtain 

dr - - - - 

(n + 1) = 2(A(n)K{n))"’"(A(n)K{n)a{n -f 1) — d{n)) + 2A€>(n)^€>(n)a(n -I- 1) (14) 

aoL 

At the extremum the system solution is; 

OLin -f 1) = [A{n)'^A{n)K{n) -f XI]~^ A{n)'^d{n) (15) 

In the online scenario, at each time step, we are faced with either one of the following two cases. 


4 






1. c^(a;(n)) could be approximated by C(n — 1) accordingtoALD, that is ^2 < S. In this case, C(n) =C(n — 1). 

2. d 2 > 6 and the new data (p{xn) is not ALD on C(n — 1). Therefore, C{n) = C(n — 1) lj{a;(n)}. 


The key issue in this problem is how to design a iterative solution to obtain q:„. In the following, we denote P{n) = 
[A{nY' A(n)K{ n) + A7] ^ and derive the KRLS with regularization for each of these two cases. 

Dictionary doesn’t change: In this case, 4>(n) = $(n — 1) and hence Kin) = K{n — 1). Only A changes between 
time steps: Ain) = [A{n — 1)^, a(n)]^. Therefore, 

Ain)'^ A{n) = A{n — V}^ A{n — 1) + a(n)a(n)^ (16) 

Such matrix P{n) can be expressed as Pin) = [P{n — 1)“^ + a{n)a{n)'^K{n)]~^. According to the matrix 
inversion lemma, assume P{n — 1) = A, a{n) = B, a{n)'^K(n) = C, I = D, yields. 


Pin) 


P{n - l)an ■ a{n)'^K{n) ■ P{n - 1) 

P(n — 1)-=- 

1 + a{n)'^K{n) ■ P{n — l)a{n) 


(17) 


Defining q{n) = P{n — l)a„/(l + a{n)'^K{n) ■ P{n — l)a{n)) and s{n) = a{n)'^K{n) ,The coefficient vector 
a.{n + 1) could then be expressed as 


a{n + 1) 


Pin)A{n)'^ d{n) 

p ^ P{n — l)a{n) ■ a{n)'^K{n) ■ P{n— 
1 + a{n)’^it(n) ■ P{n — l)a{n) 



a{n) 

a{n) 


P{n — l)a(n)[d(n) — a{n)'^ K {n)a{n)] 
1 + a{n)'^K{n) ■ Pin — l)a(n) 
q{n){d{n) — s{n))a{n) 




l) + a(n)d(n)] 


(18) 


The size of center dictionary increases: In this condition, $(n) = [^(n), 1 ^( 11 ( 71 ))]. The matrix A changes to 


A(n) = 


A{n - 1) 0 

0 1 


(19) 


Therefore, 


A{n)'^ A{n) = 
P(n) = 


A[n — l)^A(n — 1) 0 

0 1 


( 20 ) 

( 21 ) 


P{n) = -iin)- 


( 22 ) 


P{n — 1) ^ A[n — l)^A(n — \)h{n) 
h{n)'^ X + Knn 

where h{n) = [k(ci, u{n )),..., K{ck^_^, u{n))]'^ and Knn is the simplification of K{u{n),u{n)). Utilize the block 
matrix inversion identity, we obtain 

P{n - l)-ly{n) + ZAin)z{n)'^ -ZA{n) 

—z{n)'^ 1 

where y{n) = A + — h{n)'^ ZA^n), ZA{n) = P{n — l)A{n — l)'^A{n)h{n), and z{n) = P{n— l)'^h{n). Such 

the coefficient vector is updated as 

ain + 1) = Pin + 1) A{n)'^ d{n) 

P{n — l)'yin) + ZAin)z{n)'^ —ZAin) A(n — l)^<i(n — 1) 

—z{n)'^ 1 

ain) — ZAin)yin)~^ein) 

7 (n)“^e(n) 


= lin)' 


(23) 


where e(n) = din) — h’^ain). 


We now have obtained a recursive algorithm to solve the KRLS with regularization, which is referred to as is described 
in pseudocode in Algorithmll] Table [T] summarizes the computational costs per iteration for KRLS-ALD with and 
without regularization. 
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Algorithm 1 Kernel RLS with regularization algorithm 

Initialization: Select the threshold 5 > 0 and the regularization parameter A > 0, 
C(l) = {«(!)}, P(l) = [kii + A]-\ k{l) = /rii, 
kil)-^ = Ail) = 1, a(l) = P(l)y(l) 
for n = 2, 3 ... do 

Get the new sample; iu{n),d{n))', 

Compute /i(n) 

ALD test; a(n) = k{n — l)“^h,(n) 

d2(n) = Knn “ h{n)^a{n) 

if d 2 {n) < 5"^ then 

C(n) = C(n - 1), kin) = kin - 1), JK(n)-i = JK(n - l)-i 

Compute Q;(n + 1) (EgufTSll 

Update P(n) (EgufTTIi 

Update Ain), Ain) = [A(n — 1)^, a(n)]^ 

else 

C(n) = C(n - 1 )U{m(’^)} 

Compute Q;(n + 1) (EouE^ 

Update P(n) (Eoul22li 
Update kin)~^ and kin) 

Update Ain) (EgufT9ll 
end if 
end for 

return a in + 1), C(n) 


Table 1; Computational costs per iteration 


ALD test 

OiKin))^ 

Update P(n) 

OiKin))^ 

Update a(n) 

0(K(n)) 


3.2 Kernel Least Mean Square Algorithm 


The KLMS utilizes the gradient descent techniques to search for the optimization solution of KRLS and is the least 
mean square algorithm (LMS) in RKHS. KLMS is obtained by minimizing the instantaneous cost function; 


J{n) = ie(n)2 

= h\^'^'Piuin))-din)f 


(24) 


Assume the initial condition of weight is 0(0) = 0, then use the LMS algorithm on RKHS, yields 

r2(n + 1) = 0(n) + r]ein)ipiuin)) (25) 

However, the dimensionality of <^(.) is very high, so an alternative way is needed. 

0(n + 1) = 0(n) + rfein)(piuin)) 

=ri(n — 1) + r]ein — l)ipiuin — 1)) + ? 7 e(n)<^(it(n)) 


= r]^eii)ipiuii)) 


According to the “kernel trick” the output of system of the new input tt(n) can be expressed as 


n—1 


flin)^ (fiuin)) = [p^e(*)tp(M(i))](p(M(n)) 

n —1 

= v'^eii)Kiuii),uin)) 


(26) 


(27) 
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In conclusion, the learning rule of KLMS in the original space is as show in Algorithm. |2] Among all of the kernel 


Algorithm 2 Kernel least mean square algorithm 
Initialization: Select the stepsize rj, 

e(l) = d{l); 
y(l) = ? 7 e(l); 

Computation 

while {«(«), d{n)} available do 

yin) = vJ2i=iHi)niuii),u{n))]; 

e{n) = d{n) - yin)', 

end while 


adaptive hlters, KLMS is unique. It provides well-posedness solution with hnite data ifTol and naturally creates a 
growing radial-basis function (RBF) network m. Moreover, as an online learning algorithm, KLMS is much simpler 
to implement, with respect to computational complexity and memory storage, than other batch-model kernel methods. 

3.3 Kernel Affine Projection Algorithm 

The KLMS algorithm simply uses the current values to search the optimal solution. While the affine projection 
algorithm (APA) adopts better approximation by using the K most recent samples and provides a trade-off between 
computation complexity and system performance. More interestingly, the KAPA provides a general framework for 
several exiting techniques, including KLMS, sliding window KRLS and kernel regularization networks [1]. 


4 Discussion 

Even though kernel adaptive hlters have advantages mentioned above and have been proved to be useful in complicated 
nonlinear regression and classihcation problems, they still have some drawbacks. For example, the conventional cost 
function, mean-square error (MSE) criterion, can not obtain the best performance in non-Gaussian situations. More¬ 
over, the linear growing structure with each new sample leads to high computation burden and memory requirement. 
Therefore, computational complexity and memory increase linearly with time as more samples are processed, partic- 
ulary for continuous scenarios which hinder so far online application, such as in DSP and FPGA implementations. 
Therefore to apply these powerful kernel adaptive hlters in practice, we need to address two main issues hrst; 

• How to obtain a robust performance in non-Gaussian situations? 

• How to decrease the growing computational burden and memory requirement? 

These two issues are important aspects to judge system performance and are highly related. Improving robust perfor¬ 
mance normally results a the computational complexity increase. Therefore, appropriate processes, such as reducing 
the network size, to decrease computational complexity are necessary. On the other hand, approximation techniques 
to decrease computation complexity may suffer from system performance getting worse and could be compensated by 
improving system robust performance. Throughout the literature, similar problems have been studied from different 
perspectives leading to a myriad of techniques. Yet few techniques have been adopted to kernel adaptive hlters. Our 
goal is to provide appropriate methods which take into account the intrinsic properties of kernel adaptive hlters to 
solve these two problems. 

There are various factors inhuencing kernel adaptive hlters performance, such as cost function and kernel function. 
First, let us consider the cost function. MSE criterion is equivalent to maximum likelihood technique in linear and 
Gaussian condition and obtains good performance, while it is not enough in other scenarios. The least mean p - power 
error (MPE) criterion, builds a linear weighted combination of various powers of the prediction error, is investigated 
to solve this problem. However, many free parameters and prior knowledge requirements limit its wide application in 
practice. Recently, Information theoretic learning (ITL) has been proved more efficient to train adaptive systems. Dif¬ 
ferent from conventional error cost function, ITL optimizes the information content of the prediction error to achieve 
the best performance in terms of information hltering. By taking into account the whole signal distribution, adap¬ 
tive systems training through information theoretic criteria have better performance in various applications especially 
for which Gaussianity assumption is not valid. Furthermore, the nonparametric property in cost function and clear 
physical meaning also motivate us to introduce the ITL into the kernel adaptive hlters to achieve a robust system and 
propose the kernel maximum correntropy algorithm (KMC). 
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Even though the accuracy of a system is improved, the redundant computational and memory burden is what we have 
to face with. Therefore the other important problem for the practical application of kernel adaptive filters is compacting 
network structure, which helps us decrease the total computational complexity. By including the important information 
and discarding relatively less useful data, growing and pruning techniques maintain the system network size into an 
acceptable range. Different from previous sparsification techniques, the quantized kernel least mean square utilized 
“redundant” data to update the network not purely discarded them to obtain a compact system with higher accuracy. As 
a complement of spasification, pruning strategies makes the system structure more compact. Sensitive analysis based 
pruning strategies, discarding the units which make insignificant contribution to the overall network, are robust and 
widely utilized in various areas. Generally, omitting any information brings the accuracy down. Hence, the growing 
and pruning techniques is a trade off between system performance and compact structure. Therefore, a measure of 
significance is adopted in kernel adaptive filter to estimate the influence of process data and decide what information 
will be discarded. 

The significance measure guides the system to fix the network size on a predefined threshold. However, this is not a 
real nonstationary learning. What we expected is that the network size should optimally be dictated by the complexity 
of the true system and the signal, while the system accuracy is acceptable. If we solve this problem, we have a truly 
online algorithm to handle real world nonstationary problems. This problem is transferred as how to obtain a trade-off 
between system complexity and accuracy performance. A variety of information criteria have been proposed so far to 
deal with this compromise problem. Among them. Minimal Description Length (MDL) has great advantages of small 
computational costs and robust against noise, which have been proven in many applications particular in information 
learning area. MDL criterion has two formulations: batch model and online model. Taking the approximation level 
selection in KRLS-ALD as an example, the batch model MDL in kernel adaptive filters is illustrated firstly. Then we 
proposed an KLMS sparsification algorithm to explain the online model MDL. Owning to this proposed algorithm 
separating the input (feature) space with quantization techniques, it is called QKLMS-MDL. 

This article mainly focus on the improvement of KLMS, the simplest of the kernel adaptive filters, but we believe, 
further extentions to other kernel adaptive filters, such as kernel recursive least square algorithm (KRLS) will broaden 
the the scope of the applications currently addressed in KLMS. 


References 

[1] W. Liu, J. Principe and S. Haykin, Kernel Adaptive Filtering: A Comprehensive Introduction, WILEY, 2010. 

[2] Yanping Huang and Rajesh PN Rao. Predictive coding. Wiley Interdisciplinary Reviews: Cognitive Science, 
2(5):580-593,2011. 

[3] J Wu, R Tse, CL Heike, and LG Shapiro. Learning to compute the symmetry plane for human faces. In ACM 
Conference on Bioinformatics, Computational Biology and Health Informatics, 2011. 

[4] I. Steinwart, “On the influence of the kernel on the consistency of support vector machines,” Journal of Machine 
Learning Research, vol. 2, pp. 67-93, 2001. 

[5] V. Vapnik, The nature of statistical learning theory. Springer, New York, 1995. 

[6] L. Girosi, M. Jones and T. Poggio, “Regularization theroy and neural networks architectures,” Neural Compuata- 
tion, vol. 7, pp. 219-269, 1995. 

[7] B. Scholkopf, A. Smola and K. Muller, “Nonlinear component analysis as a kernel eigenvalue problem,” Neural 
Compuatation, vol. 10, no.5, pp. 1299-1319, Jul. 1998. 

[8] M. Sebastian, R. Gunnar , W. Jason and S. Bernhard, “Lisher Discriminant Analysis With Kernels,” Neural Net¬ 
works Signal Process Processing, pp. 41-48, Aug. 1999. 

[9] Y. Engel, S. Mannor and R. Meir, “The Kernel Recursive Least-Squares Algorithm,” IEEE Transactions on signal 
processing, vol. 52, no.8, pp. 2275-2285, 2004. 

[10] W. Liu, P. Pokharel and J. Principe, “The kernel least mean quare algorithm,” IEEE Transactions on Signal 
Processing, vol. 56, iss. 2, pp. 543-554, 2008. 

[11] E. Parzen, “On the estimation of a probability density function and the mode,” The Annals of Mathematical 
Statistics, vol. 33, no. 3, pp. 1065-1076,1962. 

[12] W. Liu, P. Pokharel, J. Principe, “Correntropy: Properties and Applications in Non-Gaussian Signal Processing,” 
IEEE Transactions on Signal Processing, vol. 55, iss. 11, pp. 5286-5298, 2007. 

[13] J. Xu, P. Pokharel, A. Paiva and J. Principe, “Nonlinear Component Analysis Based on Correntropy,” Interna¬ 
tional Joint Conference on Neural Networks, pp. 1851-1855, 2006. 



[14] A. Gunduz and J. Principe, “Correntropy as a Novel Measure for Nonlinearity Tests,” Signal Processing, vol. 89, 
iss. l,pp. 14-23, Jan. 2009. 

[15] Yanping Huang and Rajesh P Rao. 2014. Neurons as monte carlo samplers: Bayesian inference and learning in 
spiking networks. Advances in Neural Information Processing Systems (NIPS), pages 1943-1951. 

[16] K. Jeong and J. Principe, “The Correntropy MACE Filter for Image Recognition,” IEEE International Workshop 
on Machine Learning for Signal Processing, pp. 9-14, Sep. 2006. 

[17] I. Park and J. Principe, “Correntropy based Granger Causality,” International Conference on Acoustics, Speech 
and Signal Processing, pp. 3605-3608, Mar. 2008. 

[18] A. Singh, J. Principe, “Using correntropy as a cost function in linear adaptive filters,” Proceedings of Interna¬ 
tional Joint Conference on Neural Network, pp. 2950-2955, Jun. 2009. 

[19] A. Singh, J. Principe, “A loss function for classification based on a robust similarity metric,” Proceedings of 
International Joint Conference on Neural Network, pp. 1-6, Jul. 2010. 

[20] I. Santamaria, P. Pokarel and J. Principe, “Generalized correlation function: Definition, properties and application 
to blind equalization,” IEEE Transactions on Signal Processing, vol. 54, no. 6, pp. 2187-2197, 2006. 

[21] C. Micchelli, Y. Xu, H. Zhang and G. Lugosi, “Universal Kernel,” The Journal of Machine Learning Research, 
vol. 7,pp. 2651-2667,2006. 

[22] A. Singh, J. Principe, “Information theoretic learning with adaptive kernels,” Signal Processing, vol. 91, iss. 2, 
pp. 203-213, Feb. 2011. 

[23] F. lannacci, Y. Huang. ChirpCast: Data Transmission via Audio. UW technical report 2015. 

[24] Jia Wu, Raymond Tse, and Linda G Shapiro. Automated face extraction and normalization of 3d mesh data. In 
Engineering in Medicine and Biology Society (EMBC), 2014 36th Annual International Conference of the IEEE, 
pages 750-753. IEEE, 2014. 

[25] Nan Wang. 2015. Information retrieval using markov decision process. International Journal of Computer 
Systems, 2(6):307-311. 

[26] Jia Wu, Raymond Tse, and Linda G Shapiro. 2014. Learning to rank the severity of unrepaired cleft lip nasal 
deformity on 3d mesh data. In Pattern Recognition (ICPR), 2014 22nd International Conference on, pages 460^64. 
IEEE. 


9 



